ONLINE VARIANTS OF THE CROSS-ENTROPY METHOD 
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Abstract. The cross-entropy method [2] is a simple but efficient method for global 
optimization. In this paper we provide two online variants of the basic CEM, together 
with a proof of convergence. 



1. Introduction 

It is well known that the cross entropy method (CEM) has [2] similarities to many 
other selection based methods, such as genetic algorithms, estimation-f-distribution al- 
gorithms, ant colony optimization, and maximum likelihood parameter estimation. In 
this paper we provide two online variants of the basic CEM. The online variants reveal 
similarities to several other optimization methods like stochastic gradient or simulated 
annealing. However, it is not our aim to analyze the similarities and differences between 
these methods, nor to argue that one method is superior to the other. Here we provide 
asymptotic convergence results for the new CE variants, which are online. 

2. The algorithms 

2.1. The basic CE method. The cross-entropy method is shown in Figure [TJ For an 
explanation of the algorithm and its derivation, see e.g. [2]. Extensions of the method 
allow various generalizations, e.g., decreasing a, varying population size, added noise 
etc. In this paper we restrict our attention to the basic algorithm. 

2.2. CEM for the combinatorial optimization task. Consider the following prob- 
lem: 

The combinatorial optimization task. Let n G N, D = {0, l} n and / : D — > R. Find a 
vector x* G D such that x* = argmin xe £) /(x). 

To apply the CE method to this problem, let the distribution g be the product of n 
independent Bernoulli distributions with parameter vector p t G [0, l] n and set the initial 
parameter vector to po = (1/2, . . . , 1/2). For Bernoulli distributions, the parameter 
update is done by the following simple procedure: 

2.3. Online CEM. The algorithm performs batch updates, the sampling distribution is 
updated once after drawing and evaluating N samples. We shall transform this algorithm 
into an online one. Batch processing is used in two steps of the algorithm: 

• in the update of the distribution g t , and 
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% inputs: 

% population size N 
% selection ratio p 
% smoothing factor a 
% number of iterations T 

Po := initial distribution parameters 
for t from to T — 1, 

% draw N samples and evaluate them 

for i from 1 to N, 

draw xy> from distribution g(pt) 

fi : .Ax'") 
sort {(xW,/j)} in descending order w.r.t. /, 
% compute new elite threshold level 

7t+i : = f\p-N] 

% get elite samples 
E t+1 := {x« | ^ > 7m } 
p' := CEBatchUpdate(£' t+1 , p t , a) 
end loop 



Figure 1. The basic cross-entropy method. 



procedure p t+ i := CEBatchUpdate(i?, p t , a) 

% E: set of elite samples 

% pt : current parameter vector 

% a: smoothing factor 

N b =\p- N] 

P':= (E xeS x)/iV b 

Pt+i := (1 - a) • p t + a ■ p' 



Figure 2. The batch cross-entropy update for Bernoulli parameters. 



• when the elite threshold is computed (which includes the sorting of the iV samples 
of the last episode). 

As a first step, note that the contribution of a single sample in the distribution update 
is a% := a/\p- N~\ , if the sample is contained in the elite set and zero otherwise. We can 
perform this update immediately after generating the sample, provided that we know 
whether it is an elite sample or not. To decide this, we have to wait until the end of 
the episode. However, with a small modification we can get an answer immediately: we 
can check whether the new sample is among the best p-percentile of the last N samples. 
This corresponds to a sliding window of length N. Algorithmically, we can implement 
this as a queue Q with at most iV elements. The algorithm is summarized in Figure [31 
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% inputs: 

% window size N 

% selection ratio p 

% smoothing factor a 

% number of samples K 

Po := initial distribution parameters 
Q:={} 

for t from to K — 1, 

% draw one samples and evaluate it 

draw from distribution g(pt) 

ft := /(x<*>) 

% add sample to queue 

(): Q.-{V. x'-.j))} 

if LengthOf(Q)> N, % no updates until we have collected N samples 
delete oldest element of Q 
% compute new elite threshold level 

:= sort /—values in Q in descending order 

7m : = f[p.zr\ 

if /(x«) > 7t+1 then 

% is an elite sample 
Pt+i := CEOnlineUpdate(x(*), p t , ot/\p ■ N~\) 
endif 
endif 
end loop 



Figure 3. Online cross-entropy method, first variant. 

For Bernoulli distributions, the parameter update is done by the simple procedure 
shown in FigHl 



procedure p t +i '■= CEOnline Update (x, p t , «i) 

% x: elite sample 

% p t - current parameter vector 

% a.\. stepsize 

Pt+i := (1 - Qfi) • pt + oti • x 



Figure 4. The online cross-entropy update for Bernoulli parameters. 

Note that the behavior of this modified algorithm is slightly different from the batch 
version, as the following example highlights: suppose that the population size is iV = 100, 
and we have just drawn the 114th sample. In the batch version, we will check whether 
this sample belongs to the elite of the set {x^ 101 ^ , . . . , x^ 200 ) } (after all of these samples 
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are known), while in the online version, it is checked against the set {x^ 14 \ . . . , x^ 114 )} 
(which is known immediately). 



2.4. Online CEM, memoryless version. The sliding window online CEM algorithm 
(Fig. [3]) is fully incremental in the sense that each sample is processed immediately, and 
the per-sample processing time does not increase with increasing t. However, processing 
time (and required memory) does depend on the size of the sliding window iV: in order to 
determine the elite threshold level j t , we have to store the last N samples and sort them0 
In some applications (for example, when a connectionist implementation is sought for), 
this requirement is not desirable. We shall simplify the algorithm further, so that both 
memory requirement and processing time is constant. This simplification will come at 
a cost: the performance of the new variant will depend on the range and distribution of 
the sample values. 

Consider now the sample at position N e = \p • N~\ , the value of which determines the 
threshold. The key observation is that its position cannot change arbitrarily in a single 
step. First of all, there is a small chance that it will be removed from the queue as 
the oldest sample. Neglecting this small-probability event, the position of the threshold 
sample can either jump up or down one place or remain unchanged. More precisely, 
there are four possible cases, depending on (1) whether the new sample belongs to the 
elite and (2) whether the sample that just drops out of the queue belonged to the elite 

(A) both the new sample and the dropout sample are elite. The threshold position 
remains unchanged. So does the threshold level except with a small probability 
when the new or the dropout sample were exactly at the boundary. We will 
ignore this small-probability event. 

(B) the new sample is elite but the dropout sample is not. The threshold level 
increases to j t +i '■= It + /jv s +i — In b (ignoring a low-probability event) 

(C) neither the new sample nor the dropout sample are elite. The threshold remains 
unchanged (with high probability). 

(D) the new sample is not elite but the dropout sample is. The threshold level 
decreases to j t +i ■= It + /jv e -i - /W e - 

Let J~t denote the cx-algebra generated by knowing all random outcomes up to time 
step t. Assuming that the positions of the new sample and the dropout sample are 
distributed uniformly, we get that 

E(j t+ i | jF t ,new sample is elite) 

= 7t + Pr(case A) • E{f Ne+1 - f Ne | T t ) + Pr(case B) • 

« 7* + (1 - P) ■ E(f Ne+ i - f Ne | T t ) 
= 7t + (1 - p) ■ A t , 



Processing time can be reduced to O(logiV) if insertion sort is used: in each step, there is only one 
new element to be inserted into the sorted queue. 
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where we introduced the notation A t = E(fN e +i — /jv e | Tt). Similarly, 

£?(7t+i | JF t ,new sample is not elite) 

= 7t + Pr(case C) ■ + Pr(case D) ■ E(f Ne ^ - f Ne \ T t ) 

w 7t + p • E(f N ^ x - f Ne | .F t ) 

w 7* - p • A t , 

using the approximation that E(f Ne _ 1 - f Ne \ F t ) « -E(f Ne+1 - f Ne \ F t ) = A t . 

A t can drift as i grows, and its exact value cannot be computed without storing the 
/-values. Therefore, we have to use some approximation. We present three possibilities: 

(1) use a constant stepsize A. Clearly, this approximation works best if the dis- 
tribution of /-value differences does not change much during the optimization 
process. 

(2) assume that function values are distributed uniformly over an interval [a, b}. In 
this case, A t = (6 - a)/(N + 1). On the other hand, let D t = E(\f(x®) - 
/(x(* +1 ))|). /(x^) and /(x( t+1 )) are independent, uniformly distributed samples, 
so we obtain D t = (b - a)/3, i.e., A t = Ag niform A with Ag niform = From 
this, we can obtain an online approximation scheme 

A m := (1 - (3)A t + (3 ■ Ar form |/(xW) - /(x( t+1 ))|, 

where (3 is an exponential forgetting parameter. 

(3) assume that function values have a normal distribution ~ N(fi,a 2 ). In this 
case, A t = o"($ _1 (l — p + -k) — — p)), where $ is the Gaussian error 
function. On the other hand, let D t = E(\f{x®) - /(x(* +1 ))|). /(x^) and 
/(x(* +1 )) are independent, normally distributed samples, so we obtain D t = 

i.e., A t = A<f auss A with A[f auss = 2 v ^F($- 1 (i - p +±)- $-!(l - p)). From 
this, we can obtain an online approximation scheme 

A m := (1 - (3)At + P ■ A« auss |/(xW) - /(x(* +1 ))|, 

where (3 is an exponential forgetting parameter. 

(4) we can obtain a similar approximation for many other distributions /, but the 
constant Aq does not necessarily have an easy-to-compute form. 

The resulting algorithm using option (1) is summarized in Fig. 

3. Convergence analysis 

In this section we show that despite the various approximations used, the three vari- 
ants of the CE method possess the same asymptotical convergence properties. Naturally, 
the actual performance of these algorithms may differ from each other. 

3.1. The classical CE method. Firstly, we review the results of Costa et al. [1] on 
the convergence of the classical CE method. 

Theorem 3.1. If the basic CE method is used for combinatorial optimization with 
smoothing factor a, p > and Pqj G (0, 1) for each i G {1, . . . , m}, then p t converges to 
a 0/1 vector with probability 1. The probability that the optimal probability is generated 
during the process can be made arbitrarily close to 1 if a is sufficiently small. 
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% inputs: 

% window size N 

% selection ratio p 

% smoothing factor a 

% number of samples K 

Po := initial distribution parameters 

70 := arbitrary 

for t from to K — 1, 

% draw one samples and evaluate it 
draw from distribution g(pi) 
if /(xW) > 7t then 

% is an elite sample 

% compute new elite threshold level 

7m : = It + (1 - p) ■ A 
Pt+i := CEOnlineUpdate(xW, p t , a/ (p • N)) 
else 

% compute new elite threshold level 

7m := It ~ P ■ A 
endif 

% optional step: update A 
% A := (1 - /?)A + p ■ A |/(xW) - /(xC*- 1 ))! 
end loop 



FIGURE 5. Online cross-entropy method, memoryless variant. 

The statements of the theorem are rather weak, and are not specific to the particular 
form of the algorithm: basically they state that (1) the algorithm is a "trapped random 
walk" : the probabilities may change up an down, but eventually they converge to either 
one of the two absorbing values, or 1; and (2) if the random walk can last for a 
sufficiently long time, then the optimal solution is sampled with high probability. We 
shall transfer the proof to the other two algorithms below. 

3.2. The online CE methods. 

Theorem 3.2. // either variant of the online CE method is used for combinatorial 
optimization with smoothing factor a, p > and po,i G (0, 1) for each i 6 {1, . . . , n}, 
then p t converges to a 0/1 vector with probability 1. The probability that the optimal 
probability is generated during the process can be made arbitrarily close to 1 if a is 
sufficiently small. 

Proof. The proof follows closely the proof of Theorems 1-3 in [1J. We begin with 
introducing several notations. Let x* denote the optimum solution, let JF 4 denote 
the cr-algebra generated by knowing all random outcomes up to time step t. Let 
<pt '■= Pr(x = x* | Tt-i) the probability that the optimal solution is generated at 
time t and (f) t) i := Pr(xj = x* \ Tt-\) the probability that component i is identical to 
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that of the optimal solution. Clearly, (f> tji = p t -i ti l{x* = 1} + (1 — p t -i t i)l{x* = 0} and 

<i>t = nr=i <k,i- 

Let pf 1 " 1 and p™ ax denote the minimum and maximum possible value of pt,i, respec- 
tively. In each step of the algorithms, pt,i is either left unchanged or modified with 
stepsize a± := a/N e . Consequently, 

P? = Po,i(l-ai) t 

and 



max 
Pt,i 



P0,i(l - + Q!l(l - Q!l)* j 

t 

= po,i(l - + ^(1 - (1 - ai)) (1 - ai)*- j 

3=1 

= po,i(l-ai)* + l-(l-ai)*. 

Using these quantities, 

C in = CMiK = i} + (i-raiK = o} 

= (1 - a,)' (p 0:i l{x* = !} + (!- Po,i)l{x* = 0}) 
= ^(l-ai)*, 



it 



(mm 

h 



n^s n =^(i-«i) T 



i=l 



Let i? t = n^ =1 {x( m ) 7^ x*} denote the event that the optimal solution was not 
generated up to time t. Let 1Z t denote the set of possible values of 4> t . Clearly, for all 
r G IZt, r > 0" im . Note also that Pr(x^ = x* | <f> t , E t _i) = r by the construction of the 
random sampling procedure of CE. Then 



Pr(x« = x* | Ek-i) = Pr ( xW = x * I <k> ^-0 Pr (& = r I Et -^ 

relit 



rGTlt 

> 0f n = 0i(i-«ir- 

Using this, we can estimate the probability that the optimum solution has not been 
generated up to time step T: 

T 

Yi{E T ) = Pr^OjJPr^lfU) 

t=2 
T 

= Pr(£? 1 )JJ(l-Pr(x(*)=x*|S t _ 1 )) 

t=2 
T 

< Pr( J E 1 )n( 1 -^( 1 -«i) n< )- 



t=2 
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Using the fact that (1 — u) < e u , we obtain 

T 

Pr(E T ) < Pr(S 1 )JJexp(-0 1 (l-a 1 ) nt ) 

t=2 

= Pr^Oexp^i^l-aO^. 

Let 

oo 1 

: = V (1 - ai ) n * = - 1. 

V 1J ^ 11 1 - (1 -a ± ) n 

With this notation, 

lim Pr{E T ) < Pr(E 1 ) exp (-MM) • 

However, /i(«x) — > as cti — > 0, so lim^^oo Pr(£' T ) can be made arbitrarily close to zero, 
if ai is sufficiently small. 

To prove the second part of the theorem, define Z tji = p t ,i — Pt-i,i- For the sake of 
notational convenience, we fix a component % and omit it from the indices. Note that 
Z t 7^ if and only if is considered an elite sample. Clearly, if x^ is not elite, 
then no probability update is made. On the other hand, an update modifies p t towards 
either or 1. Since < p t < 1 with no equality allowed, this update will change the 
probabilities indeed. Consider the subset of time indices when probabilities are updated, 
I = {t : Z t 0}. We need to show that |/| = oo. This is the only part of the proof 
where there is a slight extra work compared to the proof of the batch variant. 

We will show that each unbroken sequence of zeros in {Z t } is finite with probability 
I. Consider such a 0-sequence that starts at time t\, and suppose that it is infinite. 
Then, the sampling distribution p t is unchanged for t > t±, and so is the distribution 
F of the /-values. Let us examine the first online variant of the CEM. Divide the 
interval [ti,oo) to N + 1-step long epochs. The contents of the queue at time step 
t 1 ,t 1 + (N + l),t 1 + 2(N + 1), . . . are independent and identically distributed, because (a) 
the samples are generated independently from each other and (b) the different queues 
have no common elements. For a given queue Qt (with all elements sampled from 
distribution F) and a new sample x*^ (also from distribution F), the probability that 
is not elite is exactly 1 — p. Therefore the probability that no sample is considered 
elite for t > t± is at most lim^oc^l — p) k = 0. 

The situation is even simpler for the memoryless variant of the online CEM: suppose 
again that no sample is considered elite for t > t±, and all samples are drawn from the 
distribution F. F is a distribution over a finite domain, so it has a finite minimum J mm . 
As all samples are considered non-elite, the elite threshold is decreased by a constant 
amount pA in each step, eventually becoming smaller than / min , which results in a 
contradiction. 

So, for both online methods we can consider the (infinitely long) subsequences 
{Z t } t€l , {xW} te/ , {pt}tai etc. For the sake of notational simplicity, we shall index these 
subsequences with t = 1, 2, 3, . . .. 

From now on, the proof continues identically to the original. We will show that Z t 
changes signs for a finite number of times with probability 1. To this end, let be the 
random iteration number when Z t changes sign for the kth time. For all k, 
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(1) r k = oo r k+1 = oo, 

(2) Z Tfc < p Tk = (1 - ai)p rfe _i + ai • < (1 - a a ) < 1, 

(3) Z Tk > ^> p Tk = (1 — ax)p Tfc _i + «i • 1 > «i > 0. 

From this point on, the proof of Theorem 3 in [1] can be applied without change, 
showing that the number of sign changes is finite with probability 1, then proving that 
this implies convergence to either or 1. 

□ 
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